This is a very famous course called Introduction to Open Data Science organised at University of Helsinki in autumn 2017. My personal repository is here.
I am working with R version 3.2.5 (2016-04-14) (nickname Very, Very Secure Dishes) and currently I have RStudio version 1.0.136. I found it helpful to adjust the GUI size a bit larger and change the theme to Merbivore: Go to Tools > Global Options > Appearance, and adjust there Zoom and Editor theme. The R Markdown theme called journal is used in this course diary. Command ?html_dependency_bootstrap() to find the list of themes. The font size had to be adjusted bigger by adding these lines to index.Rmd:
<style type="text/css">
body{ /* Normal */
font-size: 20px;
}
</style>
First I did some data wrangling, i.e. formatting a dataset to be suitable and nice to be used in an analysis later, in this case in regression analysis. The data wrangling was done in an R script file create_learning2014.R and the formatted dataset is in a file learning2014.csv. As described in the wrangling file the data was originally collected for international survey of Approaches to Learning. For details and links to further information have a look at the create_learning2014.R at my repository.
After practising and doing a bit of the exercise the RStudio workspace became a bit cluttered. So, I found that a function remove(...) can be used to remove objects from the memory. It is handy to clean up the environment once in a while.
Another important thing is that you have to load the extra libraries every time you start RStudio, even though files create_learning2014.R, .RData and .Rhistory are used to load the situation where you were when quitting last time.
Let’s first read the data set into memory and have a look at the data set. Summary of the variables are below:
#getwd()
#NOTE: some functions are commented out to make the knitted output more simple.
learning2014 <- read.csv("data/learning2014.csv", header = TRUE)
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
summary(learning2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :14.00 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :32.00 Median :3.667 Median :3.188
## Mean :25.51 Mean :31.43 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :50.00 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
It consists of 166 observations of 7 variables. The variables deep (deep learning), stra (strategic learning) and surf (surface learning) represents three different dimensions of the way of learning. They have been calculated as means of several other variables in the previous data wrangling. This data is in Likert scale and based on data originally obtained from a questionnaire.
There is a noticeable deviation (skewness) in age at the highest quartile, shifting the mean age away from median value towards the older end of the distribution. This can be seen clearly from a histogram below:
library(ggplot2)
p_age <- ggplot(learning2014, aes(age)) + geom_histogram(binwidth = 1, fill="purple", colour="red") + ggtitle("Histogram of Student's Ages")
p_age
Scatter plots of all the variables using gender as a factor are presented in a matrix below. This is just to show, how much better plots can be produced by GGally and ggplot2 libraries. Those are found below this first picture:
pairs(learning2014[-1], col=learning2014$gender)
In this scatter plot matrix black dots marks females and red dots marks males
GGally and ggplot2 produces another, more informative, scatter plot matrix:
library(ggplot2)
library(GGally)
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
In this scatter plot matrix red colour marks females and blue colour marks males
For the following regression analysis the exam points will be a target (aka dependent, response) variable and let’s choose three independent (aka explanatory) variables:
Here is a regression model with multiple explanatory variables:
my_model <- lm(points ~ attitude + surf + deep, data = learning2014)
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + surf + deep, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9168 -3.1487 0.3667 3.8326 11.3519
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.35507 4.71237 3.895 0.000143 ***
## attitude 0.34661 0.05766 6.011 1.18e-08 ***
## surf -1.09114 0.83599 -1.305 0.193669
## deep -0.94846 0.79025 -1.200 0.231815
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.313 on 162 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1876
## F-statistic: 13.7 on 3 and 162 DF, p-value: 5.217e-08
From the summary above it can be seen that only attitude has statistically significant relationship with exam points (p=1.18e-08 so the risk that this is a wrong conclusion, is extremely low). Standard error (0.05766) of regression coefficient for attitude is not an order of magnitude less than the coefficient itself (0.34661), indicating that there is some variability in the estimate for the coefficient. R-squared is about 20 %. It means that on the average only 20 % of variability in predicted exam points can be explained by these three variables. The F-statistic indicates, would the model be better with fewer variables. In that case the p-value would be higher. Now p-value is 5.217e-08. Removing surf from the model gives a bit lower p-value 2.326e-08. Removing also the deep gives even lower p-value 4.119e-09.
After testing also the other variables and checking the p-values from F-tests, it can be concluded that the best model has only the attitude as an explanatory variable. According to this model below, a student has to rise attitude by three numbers in order to increase exam points by one number. Also the intercept suggests that even if a student has zero attitude, the exam points would be 12 (from the simplified model below), which is quite reasonable.
my_best_model <- lm(points ~ attitude, data = learning2014)
summary(my_best_model)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Let’s make some graphical model validation using the following plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.
par(mfrow = c(2,2))
plot(my_model, which = c(1,2,5))
A model my_model explaining exam points with attitude towards data science, use of deep learning strategies and use of surface learning strategies
par(mfrow = c(2,2))
plot(my_best_model, which = c(1,2,5))
A model my_best_model explaining exam points with attitude towards data science
As the model is linear regression it is assumed that the relationship between attitude and exam points are linear. From the scatter plots further above it can be seen that this is not exactly the case: there are some observations where exam points are exceptionally low regardless of the attitude. It can be imagined, that these observations reflects some problems – not attitude problems – in gathering exam points.
For details and links to further information about the data have a look at the wrangling script create_alc.R at my repository. The wrangled data is here.
The data has been gathered among students attending mathematics and Portugal language courses in two Portuguese schools. The variables are explained here at the source of data under the title Attribute information. Originally the data was gathered to study predicting student performance in secondary education. In this analysis the focus is on student alcohol consumption. Let’s first read the data set into memory and have a look at the variable names in the data set and summaries of the variables.
#getwd()
#NOTE: some functions are commented out to make the knitted output more simple.
alc <- read.csv("data/student_alc_analysis.csv", header = TRUE)
#colnames(alc)
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP...
## $ sex <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
## $ famsize <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, ...
## $ Pstatus <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob <fctr> at_home, at_home, at_home, health, other, services...
## $ Fjob <fctr> teacher, other, other, services, other, other, oth...
## $ reason <fctr> course, course, other, home, home, reputation, hom...
## $ nursery <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
## $ guardian <fctr> mother, father, mother, mother, father, mother, mo...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup <fctr> yes, no, yes, no, no, no, no, yes, no, no, no, no,...
## $ famsup <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes...
## $ paid <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ higher <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic <fctr> no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
summary(alc)
## school sex age address famsize Pstatus
## GP:342 F:198 Min. :15.00 R: 81 GT3:278 A: 38
## MS: 40 M:184 1st Qu.:16.00 U:301 LE3:104 T:344
## Median :17.00
## Mean :16.59
## 3rd Qu.:17.00
## Max. :22.00
## Medu Fedu Mjob Fjob
## Min. :0.000 Min. :0.000 at_home : 53 at_home : 16
## 1st Qu.:2.000 1st Qu.:2.000 health : 33 health : 17
## Median :3.000 Median :3.000 other :138 other :211
## Mean :2.806 Mean :2.565 services: 96 services:107
## 3rd Qu.:4.000 3rd Qu.:4.000 teacher : 62 teacher : 31
## Max. :4.000 Max. :4.000
## reason nursery internet guardian traveltime
## course :140 no : 72 no : 58 father: 91 Min. :1.000
## home :110 yes:310 yes:324 mother:275 1st Qu.:1.000
## other : 34 other : 16 Median :1.000
## reputation: 98 Mean :1.448
## 3rd Qu.:2.000
## Max. :4.000
## studytime failures schoolsup famsup paid activities
## Min. :1.000 Min. :0.0000 no :331 no :144 no :205 no :181
## 1st Qu.:1.000 1st Qu.:0.0000 yes: 51 yes:238 yes:177 yes:201
## Median :2.000 Median :0.0000
## Mean :2.037 Mean :0.2016
## 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :3.0000
## higher romantic famrel freetime goout
## no : 18 no :261 Min. :1.000 Min. :1.00 Min. :1.000
## yes:364 yes:121 1st Qu.:4.000 1st Qu.:3.00 1st Qu.:2.000
## Median :4.000 Median :3.00 Median :3.000
## Mean :3.937 Mean :3.22 Mean :3.113
## 3rd Qu.:5.000 3rd Qu.:4.00 3rd Qu.:4.000
## Max. :5.000 Max. :5.00 Max. :5.000
## Dalc Walc health absences
## Min. :1.000 Min. :1.000 Min. :1.000 Min. : 0.0
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 1.0
## Median :1.000 Median :2.000 Median :4.000 Median : 3.0
## Mean :1.482 Mean :2.296 Mean :3.573 Mean : 4.5
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.: 6.0
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :45.0
## G1 G2 G3 alc_use
## Min. : 2.00 Min. : 4.00 Min. : 0.00 Min. :1.000
## 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:1.000
## Median :12.00 Median :12.00 Median :12.00 Median :1.500
## Mean :11.49 Mean :11.47 Mean :11.46 Mean :1.889
## 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:2.500
## Max. :18.00 Max. :18.00 Max. :18.00 Max. :5.000
## high_use
## Mode :logical
## FALSE:268
## TRUE :114
## NA's :0
##
##
g2 <- ggplot(data = alc, aes(x = high_use, fill = sex))
g2 + geom_bar() + facet_wrap("sex")
Bar plot for frequencies of high use of alcohol amoung females and males.
Below is a graphical summary of the variables (frequency bar plots), although it is not perfect as for example the scale of x-axles for the variables absences, G1, G2 and G3 are not in correct order. It seems to be presented in alphanumerical order (1, 10, 11, 12, …, 19, 2, 20, 21, …) instead of numerical. Tip: Right click the image of plots and choose “Show image” to zoom closer.
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", ncol = 3, scales = "free") + geom_bar()
Hypothesis based on the data below: Being a male increases the risk to adopt high alcohol consumption habits.
alc %>% group_by(high_use, sex) %>% summarise(count = n(), mean_grade = mean(G3))
## Source: local data frame [4 x 4]
## Groups: high_use [?]
##
## high_use sex count mean_grade
## <lgl> <fctr> <int> <dbl>
## 1 FALSE F 156 11.39744
## 2 FALSE M 112 12.20536
## 3 TRUE F 42 11.71429
## 4 TRUE M 72 10.27778
Hypothesis based on the data below: Living in urban area decreases alcohol consumption.
alc %>% group_by(high_use, address) %>% summarise(count = n(), mean_grade = mean(G3))
## Source: local data frame [4 x 4]
## Groups: high_use [?]
##
## high_use address count mean_grade
## <lgl> <fctr> <int> <dbl>
## 1 FALSE R 50 10.40000
## 2 FALSE U 218 12.04128
## 3 TRUE R 31 10.74194
## 4 TRUE U 83 10.83133
Hypothesis based on the data below: Getting educational support from family decreases alcohol consumption.
alc %>% group_by(high_use, famsup) %>% summarise(count = n(), mean_grade = mean(G3))
## Source: local data frame [4 x 4]
## Groups: high_use [?]
##
## high_use famsup count mean_grade
## <lgl> <fctr> <int> <dbl>
## 1 FALSE no 98 11.85714
## 2 FALSE yes 170 11.66471
## 3 TRUE no 46 10.47826
## 4 TRUE yes 68 11.02941
Hypothesis based on the data below: Attending paid extra classes decreases alcohol consumption.
alc %>% group_by(high_use, paid) %>% summarise(count = n(), mean_grade = mean(G3))
## Source: local data frame [4 x 4]
## Groups: high_use [?]
##
## high_use paid count mean_grade
## <lgl> <fctr> <int> <dbl>
## 1 FALSE no 148 11.26351
## 2 FALSE yes 120 12.31667
## 3 TRUE no 57 10.66667
## 4 TRUE yes 57 10.94737
Note that all the four explanatory variables used above happened to be binary. Also drawing box plots with whiskers would have revealed more details on the variations amoung these groups.
To be continued…
#<!--NOTE: some functions are commented out to make the knitted output more simple.-->
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
The data used in this exercise is called Boston and comes from R library MASS. The data frame is about housing values in suburbs of Boston and it has 506 observations and 14 variables which are explained on this webpage. In short the variables are about crime rate, urban planning, housing, air pollution, economy, school environment, skin colour. The variables chas (Charles River dummy variable which is 1 if tract bounds river and 0 otherwise) and rad (index of accessibility to radial highways) are integers others are numerical.
Summary off the variables are below:
library(GGally)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Scatter plots of all the variables are presented in a matrix below. Tip: Right click the image of plots and choose “Show image” to zoom closer.
library(ggplot2)
p41 <- ggpairs(Boston, lower = list(combo = wrap("facethist", bins = 20)))
p41
Most of the variables are far from normally distributed. This can be seen clearly from a histogram below as there is a couple of suburbs with extreme crime rates per capita:
p4_crime <- ggplot(Boston, aes(crim)) + geom_histogram(binwidth = 1, fill="purple", colour="red") + ggtitle("Histogram of per capita crime rate by town")
p4_crime
korrelaatiot <- cor(Boston)
Let’s have a look again at the scatter plots:
korrelaatiot <- cor(Boston) and using window view (View(korrelaatiot)) of RStudio? Not always are the most interesting correlations the ones that has a high R-squared.So, obviously, it would have been a lot easier to use a package corrplot and a function:
corrplot(korrelaatiot, type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6, method="circle")
Unfortunately, the package corrplot is not available for R version 3.2.5 in my use.
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Now the values are centered around the means and scaled by dividing by standard deviation to make the value proportional to the variability.
#boston_scaled is matrix object as function class(boston_scaled) would tell. Let's change it to a data.frame:
boston_scaled <- as.data.frame(boston_scaled)
In LDA the target variable has to be categorical. So let’s create a categorical crime rate variable (crime) by cutting the original by quantiles to get the high, low and middle rates of crime:
scaled_crim <- boston_scaled$crim
bins <- quantile(scaled_crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
crime <- cut(scaled_crim, breaks = bins, label = c("low", "med_low", "med_high", "high"), include.lowest = TRUE)
#the following shows the table of the new factor crime:
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
##remove original crim from the dataset (note that here "select" is used from a package dplyr)
#str(boston_scaled)
boston_scaled <- dplyr::select(boston_scaled, -crim)
#then let's insert the new variable to data.frame:
boston_scaled <- data.frame(boston_scaled, crime)
The standardized dataset is then randomly split to a test (containing 20 % of observations) and train (80 %) sets.
n <- nrow(boston_scaled)
# choose randomly 80% of the n rows and save those rows to ind
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
Below is linear discriminant analysis fitted on the train set. Crime is the target variable and all the other variables are predictor variables. As the predictor variables have to be continuous the variable called chas (originally values 0 or 1, i.e. dichotomous/binary dummy variable) has to be removed:
str(train)
## 'data.frame': 404 obs. of 14 variables:
## $ zn : num 0.456 1.228 -0.487 -0.487 -0.487 ...
## $ indus : num -0.769 -1.441 -1.202 1.015 1.231 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 3.665 ...
## $ nox : num -1.067 -1.085 -0.947 0.512 0.434 ...
## $ rm : num 2.81 0.292 -0.173 -0.259 2.975 ...
## $ age : num -2.1377 -0.8588 0.0364 0.5871 0.8997 ...
## $ dis : num 2.428 2.373 -0.142 -0.842 -0.776 ...
## $ rad : num -0.293 -0.982 -0.867 1.66 -0.522 ...
## $ tax : num -0.4642 -0.4345 -0.7846 1.5294 -0.0311 ...
## $ ptratio: num 0.298 0.575 -0.21 0.806 -1.735 ...
## $ black : num 0.441 0.441 0.385 -3.879 0.348 ...
## $ lstat : num -1.276 -0.934 -0.184 1.49 -1.307 ...
## $ medv : num 2.2036 0.0399 -0.1232 -0.993 2.9865 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 3 1 2 4 3 4 4 3 1 3 ...
train <- dplyr::select(train, -chas)
lda.fit <- lda(crime ~ ., data = train)
#The dot above means "all the other variables"
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2376238 0.2673267 0.2500000 0.2450495
##
## Group means:
## zn indus nox rm age dis
## low 0.8732156 -0.9164840 -0.8817781 0.5060559 -0.8456494 0.8269063
## med_low -0.0803044 -0.3391901 -0.6062560 -0.1074094 -0.3975851 0.3873462
## med_high -0.3666748 0.1273574 0.3801208 0.1316003 0.3970176 -0.3559158
## high -0.4872402 1.0171737 1.0097019 -0.3747289 0.8088257 -0.8571140
## rad tax ptratio black lstat
## low -0.6923618 -0.7333181 -0.4447342 0.37483089 -0.773642088
## med_low -0.5511961 -0.5120651 -0.1013524 0.32526317 -0.176652892
## med_high -0.4212828 -0.3318570 -0.3892303 0.04303245 -0.005558964
## high 1.6375616 1.5136504 0.7801170 -0.84163619 0.883461925
## medv
## low 0.55254099
## med_low 0.02593099
## med_high 0.21432334
## high -0.65265145
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.04576121 0.61227530 -1.0154532
## indus 0.01570481 -0.27662201 0.4319820
## nox 0.37396615 -0.90435804 -1.1862094
## rm -0.13875551 -0.06473452 -0.2161885
## age 0.15591631 -0.31959112 -0.1774966
## dis -0.06432416 -0.28748760 0.2039174
## rad 3.35426144 0.82190956 0.0131156
## tax 0.07456381 0.16654208 0.3033713
## ptratio 0.12484773 0.04078448 -0.2932809
## black -0.12601283 0.03791009 0.1568363
## lstat 0.24686033 -0.17733811 0.3378243
## medv 0.23409027 -0.36913493 -0.1616292
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9554 0.0329 0.0118
In the output of LDA there can be seen from the last three rows that linear discriminant 1 (LD1; linear combination of variables that separate the target variable classes/categories) explains 95 % of the between group variance of crime rate categories (per capita crime rate by town).
Below is the LDA biplot for the train set:
#To draw a LDA biplot, the following has to be done:
#Numeric crime classes:
train_classes <- as.numeric(train$crime)
# the biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
#The biplot:
plot(lda.fit, dimen = 2, col = train_classes, pch = train_classes)
lda.arrows(lda.fit, myscale = 2)
From the picture above, it looks like the rad (“index of accessibility to radial highways”) is strongly related (high correlation, like it can be read from the scatter plot further above: 0,626) to the high crime rate. That is because of the angle between the rad arrow and a LD1 axis on small. Although, as the length of rad arrow is so long, it means that the standard deviation in rad is also high.
First it felt obscure, why the towns with the most highest crime rates has also the (second) highest values of tax variables. Excerpt from the original article:
Full value property tax rate ($/$1O,OOO), measures the cost of public services in each community. Nominal tax rates were corrected by local assessment ratios to yield the full value tax rate for each town. The coefficient of this variable should be negative.
Meaning that the lower the tax the higher is the housing values (medv) in the area. The researchers assumed that the higher the rad the lower is the housing values.
Next let’s do some model validation by using test dataset. So we check how accurately the fitted LDA model can be used in predictions. You can find help in R using a function ?predict.lda.
#As crime is the target variable, so let's remove that one from the test dataset. First let's save the correct crime classes from test data
test_classes <- test$crime
#then we remove the crime and dummy chas variables from test data
test <- dplyr::select(test, -crime)
test <- dplyr::select(test, -chas)
#str(test)
#Note: the model is the first argument in the predict function.
lda.pred <- predict(lda.fit, newdata = test)
#str(lda.pred)
# cross tabulate the results
table(correct = test_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 19 11 1 0
## med_low 2 12 4 0
## med_high 0 7 16 2
## high 0 0 0 28
From the above cross tabulation can be seen that the learned LDA model (lda.fit) performs very well with high category, but in other crime rate categories there were big proportions (about 25–50 %) of wrong predictions.
This was not included here.
Check out the link at the end of exercise 4.
To be published in 10th Dec. Edit: unfortunately this exercise had to be left without the analysis part.
#<!--NOTE: some functions are commented out to make the knitted output more simple.-->